Model Selection and Validation

As with Bayesian inference, model selection and validation are fundamental steps in statistical learning applications. In particular, we wish to select the model that performs optimally, both with respect to the training data and to external data.

Depending on the type of learning method we use, we may be interested in one or more of the following:

  • how many variables should be included in the model?
  • what hyperparameter values should be used in fitting the model?
  • how many groups should we use to cluster our data?

Givens and Hoeting (2012) includes a dataset for salmon spawning success. If we plot the number of recruits against the number of spawners, we see a distinct positive relationship, as we would expect. The question is, what sort of polynomial relationship best describes the relationship?


In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')
import warnings
warnings.simplefilter("ignore")

salmon = pd.read_table("../data/salmon.dat", sep=r'\s+', index_col=0)
salmon.plot(x='spawners', y='recruits', kind='scatter')

On the one extreme, a linear relationship is underfit; on the other, we see that including a very large number of polynomial terms is clearly overfitting the data.


In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(14,6))

xvals = np.arange(salmon.spawners.min(), salmon.spawners.max())

fit1 = np.polyfit(salmon.spawners, salmon.recruits, 1)
p1 = np.poly1d(fit1)
axes[0].plot(xvals, p1(xvals))
axes[0].scatter(x=salmon.spawners, y=salmon.recruits)

fit15 = np.polyfit(salmon.spawners, salmon.recruits, 15)
p15 = np.poly1d(fit15)
axes[1].plot(xvals, p15(xvals))
axes[1].scatter(x=salmon.spawners, y=salmon.recruits)

We can select an appropriate polynomial order for the model using cross-validation, in which we hold out a testing subset from our dataset, fit the model to the remaining data, and evaluate its performance on the held-out subset.


In [ ]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(salmon.spawners, 
                            salmon.recruits, test_size=0.3)

A natural criterion to evaluate model performance is root mean square error.


In [ ]:
def rmse(x, y, coefs):
    yfit = np.polyval(coefs, x)
    return np.sqrt(np.mean((y - yfit) ** 2))

We can now evaluate the model at varying polynomial degrees, and compare their fit.


In [ ]:
degrees = np.arange(14)
train_err = np.zeros(len(degrees))
validation_err = np.zeros(len(degrees))

for i, d in enumerate(degrees):
    p = np.polyfit(xtrain, ytrain, d)

    train_err[i] = rmse(xtrain, ytrain, p)
    validation_err[i] = rmse(xtest, ytest, p)

fig, ax = plt.subplots()

ax.plot(degrees, validation_err, lw=2, label = 'cross-validation error')
ax.plot(degrees, train_err, lw=2, label = 'training error')

ax.legend(loc=0)
ax.set_xlabel('degree of fit')
ax.set_ylabel('rms error')

In the cross-validation above, notice that the testing error is high for both very low and very high polynomial values, while training error declines monotonically with degree. The cross-validation error is composed of two components: bias and variance. When a model is underfit, bias is low but variance is high, while when a model is overfit, the reverse is true.

One can show that the MSE decomposes into a sum of the bias (squared) and variance of the estimator:

$$\begin{aligned} \text{Var}(\hat{\theta}) &= E[\hat{\theta} - \theta]^2 - (E[\hat{\theta} - \theta])^2 \\ \Rightarrow E[\hat{\theta} - \theta]^2 &= \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2 \end{aligned}$$

The training error, on the other hand, does not have this tradeoff; it will always decrease (or at least, never increase) as variables (polynomial terms) are added to the model.

Information-theoretic Model Selection

One approach to model selection is to use an information-theoretic criterion to identify the most appropriate model. Akaike (1973) found a formal relationship between Kullback-Leibler information (a dominant paradigm in information and coding theory) and likelihood theory. Akaike's Information Criterion (AIC) is an estimator of expected relative K-L information based on the maximized log-likelihood function, corrected for asymptotic bias.

$$\text{AIC} = −2 \log(L(\theta|data)) + 2K$$

AIC balances the fit of the model (in terms of the likelihood) with the number of parameters required to achieve that fit. We can easily calculate AIC from the residual sums of squares as:

$$\text{AIC} = n \log(\text{RSS}/n) + 2k$$

where $k$ is the number of parameters in the model. Notice that as the number of parameters increase, the residual sum of squares goes down, but the second term (a penalty) increases.

To apply AIC to a model selection problem, we choose the model that has the lowest AIC value.

AIC can be shown to be equivalent to leave-one-out cross-validation.


In [ ]:
aic = lambda rss, n, k: n*np.log(float(rss)/n) + 2*k

As an example, consider the body fat dataset that is used in Chapter 12 of Givens and Hoeting (2012). It measures the percentage of body fat for 251 men, estimated using an underwater weighing technique. In addition, the variables age, weight, height, and ten body circumference measurements were recorded for each subject.


In [ ]:
bodyfat = pd.read_table("../data/bodyfat.dat", sep='\s+')
bodyfat.head()

To illustrate model selection, we will consider 5 competing models consisting of different subsets of available covariates.


In [ ]:
subsets = [['weight', 'height', 'neck', 'chest', 'abd', 'hip', 'thigh', 
                            'knee', 'ankle', 'biceps'],
           ['weight', 'height', 'neck', 'chest', 'abd', 'hip', 'thigh', 
                            'knee'],
           ['weight', 'height', 'neck', 'chest', 'abd', 'hip'],
           ['weight', 'height', 'neck'],
           ['weight']]

In [ ]:
fit = pd.ols(y=bodyfat['fat'], x=bodyfat[subsets[3]])
fit

In [ ]:
k0 = len(subsets[0])
aic_values = np.zeros(len(subsets))
params = np.zeros((len(subsets), k0))

for i, s in enumerate(subsets):
    x = bodyfat[s]
    y = bodyfat['fat']
    fit = pd.ols(y=y, x=x)
    aic_values[i] = fit.sm_ols.aic
    params[i, :len(s)] = fit.beta[:-1]

plt.plot(aic_values, 'ro')
plt.xlabel('model')
plt.gca().set_xticks(np.arange(5))
plt.ylabel('AIC')

In [ ]:
p_best = params[np.where(aic_values==aic_values.min())]
p_best.round(2)

In [ ]:
aic_values

For ease of interpretation, AIC values can be transformed into model weights via:

$$w_i = \frac{\exp(-\frac{1}{2} \delta \text{AIC}_i)}{\sum_{m=1}^M \exp(-\frac{1}{2} \delta \text{AIC}_m)}$$

In [ ]:
aic_trans = np.exp(-0.5*(aic_values - aic_values.min()))
aic_probs = aic_trans/aic_trans.sum()
aic_probs.round(2)

For some problems, we can use AIC weights to perform multimodel inference, whereby we use model weights to calculate model-averaged parameter estimates, thereby accounting for model selection uncertainty.


In [ ]:
p_weighted = ((params.T * aic_probs).T).sum(0)
p_weighted.round(2)

K-fold Cross-validation

In k-fold cross-validation, the training set is split into k smaller sets. Then, for each of the k "folds":

  1. trained model on k-1 of the folds as training data
  2. validate this model the remaining fold, using an appropriate metric

The performance measure reported by k-fold CV is then the average of the k computed values. This approach can be computationally expensive, but does not waste too much data, which is an advantage over having a fixed test subset.


In [ ]:
from sklearn.cross_validation import cross_val_score, KFold

nfolds = 5
fig, axes = plt.subplots(1, nfolds, figsize=(14,4))
for i, fold in enumerate(KFold(len(salmon), n_folds=nfolds, 
                               shuffle=True)):
    training, validation = fold
    y, x = salmon.values[training].T
    axes[i].plot(x, y, 'ro')
    y, x = salmon.values[validation].T
    axes[i].plot(x, y, 'bo')
plt.tight_layout()

In [ ]:
k = 5
degrees = np.arange(8)
k_fold_err = np.empty(len(degrees))

for i, d in enumerate(degrees):
    
    error = np.empty(k)
    
    #for j, fold in enumerate(gen_k_folds(salmon, k)):
    for j, fold in enumerate(KFold(len(salmon), n_folds=k)):

        training, validation = fold
        
        y_train, x_train = salmon.values[training].T
        y_test, x_test = salmon.values[validation].T
        
        p = np.polyfit(x_train, y_train, d)
        
        error[j] = rmse(x_test, y_test, p)

    k_fold_err[i] = error.mean()
        

fig, ax = plt.subplots()

ax.plot(degrees, k_fold_err, lw=2)
ax.set_xlabel('degree of fit')
ax.set_ylabel('average rms error')

If the model shows high bias, the following actions might help:

  • Add more features. In our example of predicting home prices, it may be helpful to make use of information such as the neighborhood the house is in, the year the house was built, the size of the lot, etc. Adding these features to the training and test sets can improve a high-bias estimator
  • Use a more sophisticated model. Adding complexity to the model can help improve on bias. For a polynomial fit, this can be accomplished by increasing the degree d. Each learning technique has its own methods of adding complexity.
  • Decrease regularization. Regularization is a technique used to impose simplicity in some machine learning models, by adding a penalty term that depends on the characteristics of the parameters. If a model has high bias, decreasing the effect of regularization can lead to better results.

If the model shows high variance, the following actions might help:

  • Use fewer features. Using a feature selection technique may be useful, and decrease the over-fitting of the estimator.
  • Use a simpler model. Model complexity and over-fitting go hand-in-hand.
  • Use more training samples. Adding training samples can reduce the effect of over-fitting, and lead to improvements in a high variance estimator.
  • Increase regularization. Regularization is designed to prevent over-fitting. In a high-variance model, increasing regularization can lead to better results.

Bootstrap aggregating regression

Splitting datasets into training, cross-validation and testing subsets is inefficient, particularly when the original dataset is not large. As an alternative, we can use bootstrapping to both develop and validate our model without dividing our dataset. One algorithm to facilitate this is the bootstrap aggreggation (or bagging) algorithm.

A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.


In [ ]:
from sklearn.ensemble import BaggingRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X,y = salmon.values.T
br = BaggingRegressor(LinearRegression(), oob_score=True)
X2 = PolynomialFeatures(degree=2).fit_transform(X[:, None])
br.fit(X2, y)

In order to evaluate a particular model, the samples that were not selected for a particular resampled dataset (the out-of-bag sample) can be used to estimate the generalization error.


In [ ]:
br.oob_score_

In [ ]:
scores = []

for d in degrees:
    
    Xd = PolynomialFeatures(degree=d).fit_transform(X[:, None])
    br = BaggingRegressor(LinearRegression(), oob_score=True)
    br.fit(Xd, y)
    scores.append(br.oob_score_)

In [ ]:
plt.plot(scores)

Regularization

The scikit-learn package includes a built-in dataset of diabetes progression, taken from Efron et al. (2003), which includes a set of 10 normalized predictors.


In [ ]:
from sklearn import datasets

# Predictors: "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"
diabetes = datasets.load_diabetes()

Let's examine how a linear regression model performs across a range of sample sizes.


In [ ]:
diabetes['data'].shape

In [ ]:
from sklearn import cross_validation, linear_model

def plot_learning_curve(estimator, label=None):
    scores = list()
    train_sizes = np.linspace(10, 200, 10).astype(np.int)
    for train_size in train_sizes:
        test_error = cross_validation.cross_val_score(estimator, diabetes['data'], diabetes['target'],
                        cv=cross_validation.ShuffleSplit(train_size=train_size, 
                                                         test_size=200, 
                                                         n=len(diabetes['target']),
                                                         random_state=0)
                        )
        scores.append(test_error)

    plt.plot(train_sizes, np.mean(scores, axis=1), label=label or estimator.__class__.__name__)
    plt.ylim(0, 1)
    plt.ylabel('Explained variance on test set')
    plt.xlabel('Training set size')
    plt.legend(loc='best')

In [ ]:
plot_learning_curve(linear_model.LinearRegression())

Notice the linear regression is not defined for scenarios where the number of features/parameters exceeds the number of observations. It performs poorly as long as the number of sample is not several times the number of features.

One approach for dealing with overfitting is to regularize the regession model.

The ridge estimator is a simple, computationally efficient regularization for linear regression.

$$\hat{\beta}^{ridge} = \text{argmin}_{\beta}\left\{\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^k x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^k \beta_j^2 \right\}$$

Typically, we are not interested in shrinking the mean, and coefficients are standardized to have zero mean and unit L2 norm. Hence,

$$\hat{\beta}^{ridge} = \text{argmin}_{\beta} \sum_{i=1}^N (y_i - \sum_{j=1}^k x_{ij} \beta_j)^2$$$$\text{subject to } \sum_{j=1}^k \beta_j^2 < \lambda$$

Note that this is equivalent to a Bayesian model $y \sim N(X\beta, I)$ with a Gaussian prior on the $\beta_j$:

$$\beta_j \sim \text{N}(0, \lambda)$$

The estimator for the ridge regression model is:

$$\hat{\beta}^{ridge} = (X'X + \lambda I)^{-1}X'y$$

In [ ]:
from sklearn import preprocessing

k = diabetes['data'].shape[1]
alphas = np.linspace(0, 4)
params = np.zeros((len(alphas), k))
for i,a in enumerate(alphas):
    X = preprocessing.scale(diabetes['data'])
    y = diabetes['target']
    fit = linear_model.Ridge(alpha=a, normalize=True).fit(X, y)
    params[i] = fit.coef_

plt.figure(figsize=(14,6))
for param in params.T:
    plt.plot(alphas, param)

In [ ]:
plot_learning_curve(linear_model.LinearRegression())
plot_learning_curve(linear_model.Ridge())

Notice that at very small sample sizes, the ridge estimator outperforms the unregularized model.

The regularization of the ridge is a shrinkage: the coefficients learned are shrunk towards zero.

The amount of regularization is set via the alpha parameter of the ridge, which is tunable. The RidgeCV method in scikits-learn automatically tunes this parameter via cross-validation.


In [ ]:
for a in [0.001, 0.01, 0.1, 1, 10]:
    plot_learning_curve(linear_model.Ridge(a), a)

In [ ]:
plot_learning_curve(linear_model.LinearRegression())
plot_learning_curve(linear_model.Ridge())
plot_learning_curve(linear_model.RidgeCV())

The Lasso estimator is useful to impose sparsity on the coefficients. In other words, it is to be prefered if we believe that many of the features are not relevant.

$$\hat{\beta}^{lasso} = \text{argmin}_{\beta}\left\{\frac{1}{2}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^k x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^k |\beta_j| \right\}$$

or, similarly:

$$\hat{\beta}^{lasso} = \text{argmin}_{\beta} \frac{1}{2}\sum_{i=1}^N (y_i - \sum_{j=1}^k x_{ij} \beta_j)^2$$$$\text{subject to } \sum_{j=1}^k |\beta_j| < \lambda$$

Note that this is equivalent to a Bayesian model $y \sim N(X\beta, I)$ with a Laplace prior on the $\beta_j$:

$$\beta_j \sim \text{Laplace}(\lambda) = \frac{\lambda}{2}\exp(-\lambda|\beta_j|)$$

Note how the Lasso imposes sparseness on the parameter coefficients:


In [ ]:
k = diabetes['data'].shape[1]
alphas = np.linspace(0.1, 3)
params = np.zeros((len(alphas), k))
for i,a in enumerate(alphas):
    X = preprocessing.scale(diabetes['data'])
    y = diabetes['target']
    fit = linear_model.Lasso(alpha=a, normalize=True).fit(X, y)
    params[i] = fit.coef_

plt.figure(figsize=(14,6))
for param in params.T:
    plt.plot(alphas, param)

In [ ]:
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.Lasso(0.05))

In this example, the ridge estimator performs better than the lasso, but when there are fewer observations, the lasso matches its performance. Otherwise, the variance-reducing effect of the lasso regularization is unhelpful relative to the increase in bias.

With the lasso too, me must tune the regularization parameter for good performance. There is a corresponding LassoCV function in scikit-learn, but it is computationally expensive. To speed it up, we can reduce the number of values explored for the alpha parameter.


In [ ]:
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.LassoCV(n_alphas=10, max_iter=5000))

Can't decide? ElasticNet is a compromise between lasso and ridge regression.

$$\hat{\beta}^{elastic} = \text{argmin}_{\beta}\left\{\frac{1}{2}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^k x_{ij} \beta_j)^2 + (1 - \alpha) \sum_{j=1}^k \beta^2_j + \alpha \sum_{j=1}^k |\beta_j| \right\}$$

where $\alpha = \lambda_1/(\lambda_1 + \lambda_2)$. Its tuning parameter $\alpha$ (l1_ratio in scikit-learn) controls this mixture: when set to 0, ElasticNet is a ridge regression, when set to 1, it is a lasso. The sparser the coefficients, the higher we should set $\alpha$.

Note that $\alpha$ can also be set by cross-validation, though it is computationally costly.


In [ ]:
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.ElasticNetCV(l1_ratio=.7, n_alphas=10))

Using Cross-validation for Parameter Tuning


In [ ]:
lasso = linear_model.Lasso()

In [ ]:
alphas = np.logspace(-4, -1, 20)

scores = np.empty(len(alphas))
scores_std = np.empty(len(alphas))

for i,alpha in enumerate(alphas):
    lasso.alpha = alpha
    s = cross_validation.cross_val_score(lasso, diabetes.data, diabetes.target, n_jobs=-1)
    scores[i] = s.mean()
    scores_std[i] = s.std()

In [ ]:
plt.semilogx(alphas, scores)
plt.semilogx(alphas, np.array(scores) + np.array(scores_std)/20, 'b--')
plt.semilogx(alphas, np.array(scores) - np.array(scores_std)/20, 'b--')
plt.yticks(())
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.text(5e-2, np.max(scores)+1e-4, str(np.max(scores).round(3)))

Model Checking using Learning Curves

A useful way of checking model performance (in terms of bias and/or variance) is to plot learning curves, which illustrates the learning process as your model is exposed to more data. When the dataset is small, it is easier for a model of a particular complexity to be made to fit the training data well. As the dataset grows, we expect the training error to increase (model accuracy decreases). Conversely, a relatively small dataset will mean that the model will not generalize well, and hence the cross-validation score will be lower, on average.


In [ ]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(lasso, 
                                                         diabetes.data, diabetes.target, 
                                                         train_sizes=[50, 70, 90, 110, 130], cv=5)

In [ ]:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")
plt.legend(loc="best")

For models with high bias, training and cross-validation scores will tend to converge at a low value (high error), indicating that adding more data will not improve performance.

For models with high variance, there may be a gap between the training and cross-validation scores, suggesting that model performance could be improved with additional information.


In [ ]:
X,y = salmon.values.T
X2 = PolynomialFeatures(degree=2).fit_transform(X[:, None])

train_sizes, train_scores, test_scores = learning_curve(linear_model.LinearRegression(), 
                                            X2, y, 
                                            train_sizes=[10, 15, 20, 30], cv=5)

In [ ]:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")
plt.legend(loc="best")

Exercise: Very low birthweight infants

Compare logistic regression models (using the linear_model.LogisticRegression interface) with varying degrees of regularization for the VLBW infant database. Use a relevant metric such as the Brier's score as a metric.

$$B = \frac{1}{n} \sum_{i=1}^n (\hat{p}_i - y_i)^2$$

In [ ]:
vlbw = pd.read_csv("../data/vlbw.csv", index_col=0)

vlbw = vlbw.replace({'inout':{'born at Duke':0, 'transported':1},
             'delivery':{'abdominal':0, 'vaginal':1},
             'ivh':{'absent':0, 'present':1, 'possible':1, 'definite':1},
             'sex':{'female':0, 'male':1}})

vlbw = vlbw[[u'birth', u'exit', u'hospstay', u'lowph', u'pltct', 
      u'bwt', u'gest', u'meth', 
      u'toc', u'delivery', u'apg1', u'vent', u'pneumo', u'pda', u'cld', 
      u'ivh']].dropna()

In [ ]:
# Write your answer here

References

Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multi-Model Inference: A Practical, Information-theoretic Approach. Springer Verlag.

Givens, G. H.; Hoeting, J. A. (2012). Computational Statistics (Wiley Series in Computational Statistics)

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. Springer Verlag.

Vanderplas, J. Scikit-learn tutorials for the Scipy 2013 conference.